UNIVERSIDAD C0MPLUTENSE DE MADRID (2022-2023)¶

Final Master's Project:¶

ANALYSIS, FORECAST AND VISUALIZATION OF ENERGY MATRICES AND BALANCES¶

Company: MRC¶

Authors: GROUP 1 (Ruth, Angel, Alejandro, Eduardo, Nam, Antonio)¶

I. About the dataset¶

  • We use the dataset Energy balance matrix from 1970 to 2021 of Brazil. The dataset is downloaded in the website of Olade.

https://sielac.olade.org/

data here: https://github.com/NamNguyen2015/TFM/blob/main/datas/Option_B/Brazil_Energy%20balance%20matrix.xlsx

  • The calculations for data processing base on the Manual Olade 2021

https://github.com/NamNguyen2015/TFM/blob/main/reference_documents/Manual%20Olade%202011.pdf

Scope of work:¶

a. Background and rationale

Energy matrices are important tools in the field of energy planning and analysis. They provide a comprehensive overview of energy sources, consumption, and related factors within a particular region or system. They serve various purposes, including:

  • Allow policy makers to formulate and implement affective measures.

  • Help identify energy inefficiencies.

  • Play a crucial role in assessing the environmental impact of different energy sources.

b. Approach and methodology

The objective is to assess the evolution of an energy matrix for a specific country or region over time and develop a mid-term projection. In this context, data analysis plays a crucial role by:

  • Collecting and organizing the data in a structured format, ensuring its quality and consistency. This step is vital to establish a reliable foundation for further analysis.

  • Performing exploratory data analysis (EDA) to gain a preliminary understanding of the data. EDA helps uncover patterns, trends, and relationships within the energy matrix dataset. It provides insights into the variables at hand and informs subsequent analytical steps.

  • Building indicators to effectively evaluate and analyse the energy matrices. These indicators serve as measurable metrics that reflect specific aspects of the energy system, facilitating meaningful comparisons and assessments.

  • Developing a mid-term projection of the energy matrix using suitable machine learning algorithms and econometric techniques. By using historical data, economic and demographic projections, and applying appropriate modelling techniques, it is possible to project future trends in the energy matrix. These projections provide valuable insights into potential changes in energy sources, consumption patterns, and related factors. Decision-makers can utilize this information to formulate effective strategies and policies for the future, aiming for a more sustainable and resilient energy system.

  • Constructing visualizations such as Sankey charts and other relevant figures to analyse the energy matrices and present key information.1 Visual representations enhance the comprehension and communication of complex data, enabling stakeholders to grasp trends, interconnections, and potential areas for improvement more easily.

I.1 Load the dataset¶

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import os
<frozen importlib._bootstrap>:228: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
In [2]:
# Load the file
import requests
from io import BytesIO
#file_path = '../datas/Option_B/Brazil_Energy balance matrix.xlsx'
# Define the URL of the Excel file
url = "https://github.com/NamNguyen2015/TFM/raw/main/datas/Option_B/Brazil_Energy%20balance%20matrix.xlsx"

# Download the Excel file
response = requests.get(url)

# Check if the download was successful (status code 200)
if response.status_code == 200:
    # Read the Excel file into a Pandas DataFrame
    Dict = pd.read_excel(BytesIO(response.content), sheet_name=None, skiprows=range(4), skipfooter=3)
else:
    print("Failed to download the file")

#File excel with multi-sheet_names is read as an Dictionary (Read dirrectly from the file_path)
#Dict = pd.read_excel(file_path, sheet_name=None, skiprows=range(4), skipfooter=3)


# Create a new dictionary to store modified dataframes
modified_dict = {}

for k, df in Dict.items():
    
    # Round the values of each column in the dataframe with 2 decimals
    df = df.applymap(lambda x: round(x, 2) if isinstance(x, (int, float)) else x)
    
    # Ignore the first row (unit) in each df
    df = df.iloc[1:]

    # Rename columns in df
    df.rename(columns={'Unnamed: 0': 'SECTOR'}, inplace=True) 
    df.rename(columns={'OTHER PRIMARY_x000D_\n': 'OTHER PRIMARY'}, inplace=True)
    
    # Remove space in column names and Sector names
    df.columns = df.columns.str.strip()
    df['SECTOR'] = df['SECTOR'].str.strip()

    # Rename in a row
    df['SECTOR'].replace({'COKE PLANTS AND BLAST FURNACES_x000D_': 'COKE PLANTS AND BLAST FURNACES'}, inplace=True)
    # fill nule values
    df=df.fillna(np.nan)

    # Convert the keys of Dict to only contain the year
    k_new = k.split(' - ')[0]
    df["YEAR"] = k_new

    # Change all column names to uppercase if necessary
    df.columns = df.columns.str.upper()

    # Store the modified dataframe in the new dictionary
    modified_dict[k_new] = df
    
    # Change the unit from Ktoe to Mtoe:
    for column in df.columns:
        if pd.api.types.is_numeric_dtype(df[column]):
            df[column] = df[column].astype(float)
            df[column]=df[column]/1000
    
    
    # Round the values of each column in the dataframe with 2 decimals
    df = df.applymap(lambda x: round(x, 2) if isinstance(x, (int, float)) else x)

# Update the original dictionary with the modified dataframes
Dict = modified_dict


print(Dict.keys())

#Save the Clean Dictionary in Excel with multi-sheet_names
file_path='../datas/Datas_cleaned/Brazil_Energy balance matrix_cleaned.xlsx'
    
with pd.ExcelWriter(file_path) as writer:
    for k, df in Dict.items():
        df.to_excel(writer, sheet_name=k)
dict_keys(['1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021'])

I.2 First visualization of energy balance (Supply-Transformation-Losses-Consumption)¶

In [3]:
# Concatenate the data
for k in Dict.keys():    
     # Concatenate all DataFrames in Dict vertically
    df_concat = pd.concat(Dict.values(), ignore_index=True)
    df_concat.set_index("YEAR", inplace=True)
    df_concat.index = df_concat.index.astype(int)
    #df_concat.index = pd.to_datetime(df_concat.index, format='%Y')
    
df_concat.head()
Out[3]:
SECTOR OIL NATURAL GAS COAL HYDROENERGY GEOTHERMAL NUCLEAR FIREWOOD SUGARCANE AND PRODUCTS OTHER PRIMARY ... KEROSENE/JET FUEL DIESEL OIL FUEL OIL COKE CHARCOAL GASES OTHER SECONDARY NON-ENERGY TOTAL SECUNDARIES TOTAL
YEAR
1970 PRODUCTION 8.16158 1.10224 1.04443 3.42152 NaN NaN 31.85153 3.6007 0.22324 ... 1.30721 5.67461 8.39923 0.05518 1.76675 0.46595 0.32173 0.88677 31.06807 49.40524
1970 IMPORT 17.84505 NaN 1.35847 NaN NaN NaN NaN NaN NaN ... 0.00819 NaN NaN 0.07242 NaN NaN NaN 0.36892 0.93748 20.14101
1970 EXPORT 0.06505 NaN NaN NaN NaN NaN NaN NaN NaN ... 0.12291 0.04715 0.74834 NaN NaN NaN NaN NaN 0.92012 0.98517
1970 STOCK CHANGE -0.27733 NaN -0.16752 NaN NaN NaN NaN NaN NaN ... -0.05486 -0.04201 -0.06903 -0.02483 NaN NaN 0.01541 -0.05393 -0.34528 -0.79013
1970 UNUSED NaN 0.92222 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 0.05588 0.05045 NaN 0.10633 1.02854

5 rows × 24 columns

Show the data summary over years¶

In [4]:
# Group by Year and SECTOR, and sum the TOTAL values
selected_SECTORS=['TOTAL SUPPLY','TOTAL TRANSFORMATION','LOSSES','FINAL CONSUMPTION']
df_summary=df_concat.copy()

#df_summary=df_summary.apply(lambda column: column.fillna(0))
df_summary=df_summary[df_summary['SECTOR'].isin(selected_SECTORS)]
df_summary = df_summary.groupby(['YEAR','SECTOR'])['TOTAL'].sum().reset_index()
# Pivot the data
df_summary = df_summary.pivot(index='YEAR', columns='SECTOR', values='TOTAL')

# Reorder the columns in df
df_summary = df_summary[selected_SECTORS]

# Display the resultdf_summary
df_summary.head()
Out[4]:
SECTOR TOTAL SUPPLY TOTAL TRANSFORMATION LOSSES FINAL CONSUMPTION
YEAR
1970 66.74240 -5.42247 0.92920 58.83412
1971 70.09203 -5.43772 1.02493 61.69169
1972 75.11133 -5.93688 1.14603 65.99512
1973 81.96344 -6.48878 1.26921 71.69286
1974 87.62820 -6.95924 1.45878 76.19840

Visualize the data summary¶

In [5]:
# Create a figure and axis
fig, ax = plt.subplots(figsize=( 10,6))

for col in df_summary.columns:
    sns.lineplot(data=df_summary[col], x=df_summary.index, y=df_summary[col], label=col, ax=ax)
        
# Set labels and title
ax.set_xlabel('Year')
ax.set_ylabel(f'Values [Mtoe]')
ax.set_title(f'Total Supply, Transfomation, Losses, Final Consumption vs Year')
ax.set_xticks(df_summary.index.unique()[::5])

# Show the legend
ax.legend()

# Show the plot
plt.grid(False)
plt.show()

The negative transformation values over years typically means that more energy was consumed or transformed within the energy system than was produced or input into the system during that specific time period (year). In other words, it indicates an energy deficit or a situation where the energy demands and transformations within the system exceeded the energy available from primary sources or imports.

In [6]:
# check the Import- Export:

# Group by Year and SECTOR, and sum the TOTAL values
selected_SECTORS=['IMPORT','EXPORT']
df_=df_concat.copy()

#df_summary=df_summary.apply(lambda column: column.fillna(0))
df_=df_[df_['SECTOR'].isin(selected_SECTORS)]
df_= df_.groupby(['YEAR','SECTOR'])['TOTAL'].sum().reset_index()
# Pivot the data
df_= df_.pivot(index='YEAR', columns='SECTOR', values='TOTAL')

# Reorder the columns in df
df_= df_[selected_SECTORS]

# Display the resultdf_summary
df_.head()
Out[6]:
SECTOR IMPORT EXPORT
YEAR
1970 20.14101 0.98517
1971 23.24845 1.26902
1972 28.27661 2.65150
1973 37.87693 3.53330
1974 37.90801 3.48441
In [7]:
df_.plot(ylabel="Values [Mtoe]", title="Evolution of Import- Export")
Out[7]:
<AxesSubplot:title={'center':'Evolution of Import- Export'}, xlabel='YEAR', ylabel='Values [Mtoe]'>

It has been observed that in recent years, Brazil has become increasingly self-sufficient in energy production and less reliant on imports. Starting in 2015, the country also began exporting its energy production.

I.3 Sankey Diagram¶

  • Sankey diagrams are valuable tools for visualizing energy flows and understanding energy balances in complex systems.

  • Reviewing Sankey diagrams in an energy balance context is crucial for gaining insights into energy flows, identifying inefficiencies, ensuring a balance between supply and demand, and making informed decisions about energy policy, efficiency improvements, and environmental management. They provide a concise and visually intuitive representation of complex energy data, making it easier for stakeholders to understand and act upon energy-related challenges and opportunities.

In [8]:
from collections import defaultdict
import plotly.graph_objects as go

Define the Plot function¶

In [9]:
# Defining the Plot function
def Plot(year,db):
    label=db[year]["label"]
    source=db[year]["source"]
    target=db[year]["target"]
    value=db[year]["value"]
    color_nodes=db[year]["color_nodes"]
    color_links=db[year]["color_links"]
    fig = go.Figure(data=[go.Sankey(
        node = dict(
          pad = 30,
          thickness = 20,
          line = None, # dict(color = "black", width = 0.5),
          label = label,
          color = color_nodes
        ),
        link = dict(
          source = source, # indices correspond to labels, eg A1, A2, A1, B1, ...
          target = target,
          value = value,
            color=color_links
      ))])

    fig.update_layout(title_text=year, font_size=10)
    fig.show()

Preparation the data to plot¶

In [10]:
# Preparation the data to plot
def Data_Generate(Dict):
    
    Dict_out = {}  # Initialize the output dictionary
    for sheet_name in Dict.keys():

        df=Dict[sheet_name]
        
         # Reset index
        df=df.set_index('SECTOR')

        # Transpose df
        df=df.T

        # Fill NaN values
        df=df.fillna(np.nan)
       
        #print(df.head(5))
        # define the combinations
        Transformers=['REFINERIES', 'POWER PLANTS', 'SELF-PRODUCERS',
               'GAS PLANTS', 'CHARCOAL PLANTS', 'COKE PLANTS AND BLAST FURNACES',
               'DISTILLERIES', 'OTHER CENTERS']
        Primaries=['OIL','NATURAL GAS','COAL','HYDROENERGY','GEOTHERMAL','NUCLEAR','FIREWOOD','SUGARCANE AND PRODUCTS','OTHER PRIMARY']
        Secondaries=['ELECTRICITY','LPG','GASOLINE/ALCOHOL','KEROSENE/JET FUEL','DIESEL OIL','FUEL OIL','COKE','CHARCOAL','GASES','OTHER SECONDARY']
        Consumptions=['TRANSPORT','INDUSTRIAL','RESIDENTIAL','COMMERCIAL, SERVICES, PUBLIC','AGRICULTURE, FISHING AND MINING','CONSTRUCTION AND OTHERS']


        unique_combinations = []
          

        for i in Transformers:
            for j in Primaries:
                unique_combinations.append((j, i,abs(df[i][j])))

        for i in Transformers:
            for j in Secondaries:
                unique_combinations.append((i, j,abs(df[i][j])))

        # the final consumption column - Usage column
        for i in Consumptions:
            for j in Primaries+Secondaries:
                unique_combinations.append((j, i,abs(df[i][j])))
                

                
        label=Transformers+Primaries+Secondaries+Consumptions+Primaries+Secondaries

        
        #colors
        color_Transformers=['blue','yellow','green','orange','grey','grey','pink','cyan']
        color_Primaries=['black','orange','darkgray','lightblue','darkred','red','brown','darkgreen','khaki']
        color_Secondaries=['yellow','lightgreen','plum','plum','plum','grey','grey','grey','grey']
        color_Consumptions=['darkmagenta' for i in range(9)]
        color_nodes=color_Transformers+color_Primaries+color_Secondaries+color_Consumptions+color_Primaries+color_Secondaries
        
        # Rename the sheet_name to contain only year. Sample: "1970 - Brazil"--> "1970"
        sheet_name_new = sheet_name.split(' - ')[0]
    
        _dict=Dict_out[sheet_name_new]={}
        _dict["source"]=[]
        _dict["target"]=[]
        _dict["value"]=[]
        _dict["label"]=label
        _dict["color_nodes"]=color_nodes
        _dict["color_links"]=[]
        


        #Dict_out[sheet_name_new] = data  # Store the data in the dictionary
        for k in unique_combinations:
            _dict["source"].append(label.index(k[0]))
            _dict["target"].append(label.index(k[1]))
            _dict["value"].append(k[2])
            _dict["color_links"].append(color_nodes[label.index(k[0])])
    return Dict_out
In [11]:
db=Data_Generate(Dict)

Sample plot¶

In [12]:
Plot(year='2021',db=db)

II. DATA ANALYSIS CONSUMPTION¶

In [13]:
# Show the df final consumption:
df_C=df_concat.copy()

selected_SECTORS_C=['TRANSPORT', 'INDUSTRIAL', 'RESIDENTIAL',
       'COMMERCIAL, SERVICES, PUBLIC', 'AGRICULTURE, FISHING AND MINING',
       'CONSTRUCTION AND OTHERS',
       'NON-ENERGY CONSUMPTION','FINAL CONSUMPTION']

df_C=df_C[df_C['SECTOR'].isin(selected_SECTORS_C)]
df_C = df_C.groupby(['YEAR','SECTOR'])['TOTAL'].sum().reset_index()
# Pivot the data
df_C = df_C.pivot(index='YEAR', columns='SECTOR', values='TOTAL')

# Reorder the columns in df
df_C = df_C[selected_SECTORS_C]

# Display the resultdf_summary
df_C.head()
Out[13]:
SECTOR TRANSPORT INDUSTRIAL RESIDENTIAL COMMERCIAL, SERVICES, PUBLIC AGRICULTURE, FISHING AND MINING CONSTRUCTION AND OTHERS NON-ENERGY CONSUMPTION FINAL CONSUMPTION
YEAR
1970 12.66199 16.02092 22.07566 1.26680 5.35101 0.00000 1.45773 58.83412
1971 13.84187 17.45473 22.25396 1.40698 5.31739 0.00000 1.41675 61.69169
1972 15.63322 19.02699 22.44087 1.57069 5.33847 0.00000 1.98488 65.99512
1973 18.32890 21.49343 22.35413 1.73784 5.44191 0.00000 2.33665 71.69286
1974 20.10568 23.25701 22.31670 1.87346 5.37638 0.00715 3.26203 76.19840
In [14]:
# df_C_percentages calculate the percentages

df_C_percentages = (df_C.drop(columns=['FINAL CONSUMPTION']).div(df_C['FINAL CONSUMPTION'], axis=0) * 100)
df_C_percentages = df_C_percentages.round(0)

df_C_percentages.columns
Out[14]:
Index(['TRANSPORT', 'INDUSTRIAL', 'RESIDENTIAL',
       'COMMERCIAL, SERVICES, PUBLIC', 'AGRICULTURE, FISHING AND MINING',
       'CONSTRUCTION AND OTHERS', 'NON-ENERGY CONSUMPTION'],
      dtype='object', name='SECTOR')
In [15]:
# Drop the 'FINAL CONSUMPTION' column (if not already dropped) as it is not needed for the stack plot
df_stackplot = df_C.drop(columns=["FINAL CONSUMPTION"])

a- Show the line_plot:¶

In [16]:
data=df_stackplot.copy()

# Line_Plot:
fig, ax1 = plt.subplots(1, 1, figsize=(10, 6))

data.plot(ax=ax1)
ax1.set_xlabel('Year')
ax1.set_ylabel('Energy Consumption [Mtoe]')
ax1.set_title('Energy Consumption Over the Years')
ax1.legend(loc='upper left')

plt.show()

b- Plot both stack plots side by side¶

In [17]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(18, 6))

df_stackplot.plot.area(stacked=True, alpha=0.7, ax=ax1)
ax1.set_xlabel('Year')
ax1.set_ylabel('Energy Consumption [Mtoe]')
ax1.set_title('Energy Consumption Over the Years')
ax1.legend(loc='upper left', bbox_to_anchor=(1.0, 1.0))

df_C_percentages.plot.area(stacked=True, alpha=0.7, ax=ax2)

ax2.set_xlabel('Year')
ax2.set_ylabel(' Energy Consumption  (%)')
ax2.set_title('Percentage of  Energy Consumption Over the Years')
ax2.legend(loc='upper left', bbox_to_anchor=(1.0, 1.0))

plt.tight_layout()
plt.show()

CORRELATION MATRIX¶

In [18]:
list_corr=df_C.columns.drop(['FINAL CONSUMPTION'])
In [19]:
corr_mat = df_C[list_corr].corr()
sns.heatmap(corr_mat,annot=True, fmt=".2f", cmap="coolwarm", linewidths=.5)
plt.show()

These correlation coefficients can range from -1 to 1, with values closer to 1 indicating a strong positive correlation, values closer to -1 indicating a strong negative correlation, and values close to 0 indicating little to no linear correlation.

  • Positive Correlation (r > 0): When the correlation coefficient is positive (greater than 0), it indicates that as one variable increases, the other variable tends to increase as well

  • Negative correlations can be just as meaningful as positive correlations in data analysis. They indicate that as one variable increases, the other tends to decrease,

From the correlation matrix, we can observe that the sector 'CONSTRUCTION AND OTHERS' has a very low correlation with other sectors. We can consider dropping it if necessary.

In [20]:
df_C.drop(columns=['CONSTRUCTION AND OTHERS'], inplace=True)
In [21]:
df_C.columns
Out[21]:
Index(['TRANSPORT', 'INDUSTRIAL', 'RESIDENTIAL',
       'COMMERCIAL, SERVICES, PUBLIC', 'AGRICULTURE, FISHING AND MINING',
       'NON-ENERGY CONSUMPTION', 'FINAL CONSUMPTION'],
      dtype='object', name='SECTOR')

Scaled data¶

MinMaxScaler¶

In [22]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df_C)
scaled_df = pd.DataFrame(scaled_data, columns=df_C.columns)
scaled_df.set_index(df_C.index, inplace=True)
In [23]:
scaled_df.head()
Out[23]:
SECTOR TRANSPORT INDUSTRIAL RESIDENTIAL COMMERCIAL, SERVICES, PUBLIC AGRICULTURE, FISHING AND MINING NON-ENERGY CONSUMPTION FINAL CONSUMPTION
YEAR
1970 0.000000 0.000000 0.395950 0.000000 0.003327 0.002654 0.000000
1971 0.016081 0.020678 0.412922 0.011833 0.000000 0.000000 0.015970
1972 0.040495 0.043352 0.430713 0.025652 0.002086 0.036794 0.040021
1973 0.077235 0.078921 0.422457 0.039761 0.012323 0.059575 0.071865
1974 0.101450 0.104355 0.418894 0.051209 0.005838 0.119505 0.097046

ARIMA models for non-stationary time series¶

We apply the ARIMA model to all sectors, but first, we need to perform some checks to select the best parameters (p, d, q).

In time series analysis, whether we need to scale the data (i.e., standardize or normalize it) depends on the specific characteristics of the data and the modeling techniques we plan to use. Scaling may or may not be required.

In this section, we are focusing on modeling the univariate problem.

In [24]:
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from sklearn.metrics import mean_squared_error, r2_score

Prepare Functions¶

In [25]:
from sklearn.metrics import mean_absolute_percentage_error 

def MAPE(actual,forecast):
      # Ensure both DataFrames have the same shape
    if actual.shape != forecast.shape:
        raise ValueError("Input data shapes must match.")

    # Replace zero values in y_true with a small non-zero value to avoid division by zero
    actual = actual.replace(0, 1e-10)

    # Calculate the mean of absolute percentage errors 
    mape= round(abs((actual.to_numpy() - forecast.to_numpy()) / actual.to_numpy()).mean()*100,2)

    return mape
In [26]:
from statsmodels.tsa.stattools import adfuller

def test_stationarity(timeseries):
    
    #Perform Dickey-Fuller test:
    print('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)
    
In [27]:
# Función para evaluar residuos a través de contrastes de hipótesis
def residcheck(residuals, lags):
    """
    Function to check if the residuals are white noise. Ideally the residuals should be uncorrelated, zero mean, 
    constant variance and normally distributed. First two are must, while last two are good to have. 
    If the first two are not met, we have not fully captured the information from the data for prediction. 
    Consider different model and/or add exogenous variable. 
        
    If Ljung Box test shows p> 0.05, the residuals as a group are white noise. Some lags might still be significant. 
        
    Lags should be min(2*seasonal_period, T/5)
        
    plots from: https://tomaugspurger.github.io/modern-7-timeseries.html
        
    """
    resid_mean = np.mean(residuals)
    lj_p_val = np.mean(sm.stats.acorr_ljungbox(x=residuals, lags=lags).lb_pvalue)
    norm_p_val =  stats.jarque_bera(residuals)[1]
    adfuller_p = adfuller(residuals)[1]
        
      
    fig = plt.figure(figsize=(10,8))
    layout = (2, 2)
    ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2);
    acf_ax = plt.subplot2grid(layout, (1, 0));
    kde_ax = plt.subplot2grid(layout, (1, 1));
    
    residuals.plot(ax=ts_ax)
    plot_acf(residuals, lags=lags, ax=acf_ax);
    sns.kdeplot(residuals);
    #[ax.set_xlim(1.5) for ax in [acf_ax, kde_ax]]
    sns.despine()
    plt.tight_layout();
    plt.show()
    print("** Mean of the residuals: ", np.around(resid_mean,2))
        
    print("\n** Ljung Box Test, p-value:", np.around(lj_p_val,3), 
        "(>0.05, Uncorrelated)" if (lj_p_val > 0.05) else "(<0.05, Correlated)")
        
    print("\n** Jarque Bera Normality Test, p_value:", np.around(norm_p_val,3),
        "(>0.05, Normal)" if (norm_p_val>0.05) else "(<0.05, Not-normal)")
        
    print("\n** AD Fuller, p_value:", np.around(adfuller_p,3), 
        "(>0.05, Non-stationary)" if (adfuller_p > 0.05) else "(<0.05, Stationary)")
    
    return ts_ax, acf_ax, kde_ax   

Find the Best Models by chossing the smallest values of AIC and BIC¶

In [28]:
import statsmodels.api as sm
def order_aic_bic(df,d):
    order_aic_bic =[]

    # Loop over p values from 0-4
    for p in range(5):
        # Loop over q values from 0-4
        for q in range(5):

            try:
                # create and fit ARMA(p,q) model
                
                model = ARIMA(df, order=(p, d, q))
                results = model.fit()

                # Print order and results
                order_aic_bic.append((p, q, results.aic, results.bic))            
            except:
                print(p, q, None, None)

    # Make DataFrame of model order and AIC/BIC scores
    order_df = pd.DataFrame(order_aic_bic, columns=['p', 'q', 'aic','bic'])

    # lets sort them by AIC and BIC

    # Sort by AIC
    print("Models sorted by AIC ")
    print("\n")
    print(order_df.sort_values('aic').reset_index(drop=True)[:5])

    # Sort by BIC
    print("Models sorted by BIC ")
    print("\n")
    print(order_df.sort_values('bic').reset_index(drop=True)[:5])

II.2. Generate for all sectors¶

Step 1: Manually check to chose d:¶

In this step, we first check the stationarity of a time series by performing a statistical test, such as the Augmented Dickey-Fuller (ADF) test, and examining the p-value associated with the test. If the p-value is smaller than 0.05, we can consider the series as stationary; otherwise, we make the series stationary by applying differencing. The minimum order of differencing is the value of d.

Additionally, we plot the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) at each differencing order. This step allows us to select the appropriate values for 'p' and 'q' based on the characteristics of the ACF and PACF plots.

In [29]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
for s in scaled_df.columns:
    # Separate data (data used: scaled_df)
    
    train_size=int(len(df_C) * 0.90)
    df_tr=df_C[s].iloc[:train_size]#train[['TRANSPORT']]
    df_val=df_C[s].iloc[train_size:]
    
    # test stationary for the original data:
    print(f'{s}-Test the original')
    test_stationarity(df_tr)
    
    # Differencing 1st:
    train_diff = df_tr.diff().dropna()
    # Test stationary
    print(f'{s}-Test the 1st')
    test_stationarity(train_diff)
    # Differencing 2nd:
    train_diff_2=df_tr.diff().diff().dropna()
    # Test starionary
    print(f'{s}-Test the 2nd')
    test_stationarity(train_diff_2)
    
    plt.rcParams.update({'figure.figsize': (9, 6), 'figure.dpi': 120})

    fig, axes = plt.subplots(3, 2, sharex=True)

    # Plot ACF for original data
    plot_acf(df_tr, ax=axes[0, 0])
    axes[0, 0].set_title(f'{s}-ACF-original')
    axes[0, 0].set_xlabel('Lag')
    axes[0, 0].set_ylabel('ACF')

    # Plot PACF for train
    plot_pacf(df_tr, ax=axes[0, 1], method='ywm')
    axes[0, 1].set_title(f'{s}-PACF-original')
    axes[0, 1].set_xlabel('Lag')
    axes[0, 1].set_ylabel('PACF')


    # Plot ACF for train_diff
    plot_acf(train_diff, ax=axes[1, 0])
    axes[1, 0].set_title(f'{s}-ACF for 1st differencing')
    axes[1, 0].set_xlabel('Lag')
    axes[1, 0].set_ylabel('ACF')

    # Plot PACF for train_diff
    plot_pacf(train_diff, ax=axes[1, 1], method='ywm')
    axes[1, 1].set_title(f'{s}-PACF for 1st differencing')
    axes[1, 1].set_xlabel('Lag')
    axes[1, 1].set_ylabel('PACF')

    # Plot ACF for train_diff_2
    plot_acf(train_diff_2, ax=axes[2, 0])
    axes[2, 0].set_title(f'{s}-ACF for 2nd differencing')
    axes[2, 0].set_xlabel('Lag')
    axes[2, 0].set_ylabel('ACF')

    # Plot PACF for train_diff_2
    plot_pacf(train_diff_2, ax=axes[2, 1], method='ywm')
    axes[2, 1].set_title(f'{s} PACF for 2nd differencing')
    axes[2, 1].set_xlabel('Lag')
    axes[2, 1].set_ylabel('PACF')

    plt.tight_layout()
    plt.show()
    
    
    print(f'*************************************')
TRANSPORT-Test the original
Results of Dickey-Fuller Test:
Test Statistic                  1.478493
p-value                         0.997443
#Lags Used                      7.000000
Number of Observations Used    38.000000
Critical Value (1%)            -3.615509
Critical Value (5%)            -2.941262
Critical Value (10%)           -2.609200
dtype: float64
TRANSPORT-Test the 1st
Results of Dickey-Fuller Test:
Test Statistic                 -2.756864
p-value                         0.064704
#Lags Used                      6.000000
Number of Observations Used    38.000000
Critical Value (1%)            -3.615509
Critical Value (5%)            -2.941262
Critical Value (10%)           -2.609200
dtype: float64
TRANSPORT-Test the 2nd
Results of Dickey-Fuller Test:
Test Statistic                 -3.444595
p-value                         0.009529
#Lags Used                      4.000000
Number of Observations Used    39.000000
Critical Value (1%)            -3.610400
Critical Value (5%)            -2.939109
Critical Value (10%)           -2.608063
dtype: float64
*************************************
INDUSTRIAL-Test the original
Results of Dickey-Fuller Test:
Test Statistic                 -0.449785
p-value                         0.901449
#Lags Used                      0.000000
Number of Observations Used    45.000000
Critical Value (1%)            -3.584829
Critical Value (5%)            -2.928299
Critical Value (10%)           -2.602344
dtype: float64
INDUSTRIAL-Test the 1st
Results of Dickey-Fuller Test:
Test Statistic                 -5.317197
p-value                         0.000005
#Lags Used                      1.000000
Number of Observations Used    43.000000
Critical Value (1%)            -3.592504
Critical Value (5%)            -2.931550
Critical Value (10%)           -2.604066
dtype: float64
INDUSTRIAL-Test the 2nd
Results of Dickey-Fuller Test:
Test Statistic                 -3.912924
p-value                         0.001942
#Lags Used                      6.000000
Number of Observations Used    37.000000
Critical Value (1%)            -3.620918
Critical Value (5%)            -2.943539
Critical Value (10%)           -2.610400
dtype: float64
*************************************
RESIDENTIAL-Test the original
Results of Dickey-Fuller Test:
Test Statistic                  0.213308
p-value                         0.972998
#Lags Used                      0.000000
Number of Observations Used    45.000000
Critical Value (1%)            -3.584829
Critical Value (5%)            -2.928299
Critical Value (10%)           -2.602344
dtype: float64
RESIDENTIAL-Test the 1st
Results of Dickey-Fuller Test:
Test Statistic                 -5.425666
p-value                         0.000003
#Lags Used                      0.000000
Number of Observations Used    44.000000
Critical Value (1%)            -3.588573
Critical Value (5%)            -2.929886
Critical Value (10%)           -2.603185
dtype: float64
RESIDENTIAL-Test the 2nd
Results of Dickey-Fuller Test:
Test Statistic                -7.081189e+00
p-value                        4.661825e-10
#Lags Used                     2.000000e+00
Number of Observations Used    4.100000e+01
Critical Value (1%)           -3.600983e+00
Critical Value (5%)           -2.935135e+00
Critical Value (10%)          -2.605963e+00
dtype: float64
*************************************
COMMERCIAL, SERVICES, PUBLIC-Test the original
Results of Dickey-Fuller Test:
Test Statistic                  2.752582
p-value                         1.000000
#Lags Used                      0.000000
Number of Observations Used    45.000000
Critical Value (1%)            -3.584829
Critical Value (5%)            -2.928299
Critical Value (10%)           -2.602344
dtype: float64
COMMERCIAL, SERVICES, PUBLIC-Test the 1st
Results of Dickey-Fuller Test:
Test Statistic                 -5.187659
p-value                         0.000009
#Lags Used                      0.000000
Number of Observations Used    44.000000
Critical Value (1%)            -3.588573
Critical Value (5%)            -2.929886
Critical Value (10%)           -2.603185
dtype: float64
COMMERCIAL, SERVICES, PUBLIC-Test the 2nd
Results of Dickey-Fuller Test:
Test Statistic                 -5.083537
p-value                         0.000015
#Lags Used                      3.000000
Number of Observations Used    40.000000
Critical Value (1%)            -3.605565
Critical Value (5%)            -2.937069
Critical Value (10%)           -2.606986
dtype: float64
*************************************
AGRICULTURE, FISHING AND MINING-Test the original
Results of Dickey-Fuller Test:
Test Statistic                  1.209344
p-value                         0.996050
#Lags Used                      0.000000
Number of Observations Used    45.000000
Critical Value (1%)            -3.584829
Critical Value (5%)            -2.928299
Critical Value (10%)           -2.602344
dtype: float64
AGRICULTURE, FISHING AND MINING-Test the 1st
Results of Dickey-Fuller Test:
Test Statistic                -6.949322e+00
p-value                        9.785090e-10
#Lags Used                     0.000000e+00
Number of Observations Used    4.400000e+01
Critical Value (1%)           -3.588573e+00
Critical Value (5%)           -2.929886e+00
Critical Value (10%)          -2.603185e+00
dtype: float64
AGRICULTURE, FISHING AND MINING-Test the 2nd
Results of Dickey-Fuller Test:
Test Statistic                -6.809117e+00
p-value                        2.138701e-09
#Lags Used                     2.000000e+00
Number of Observations Used    4.100000e+01
Critical Value (1%)           -3.600983e+00
Critical Value (5%)           -2.935135e+00
Critical Value (10%)          -2.605963e+00
dtype: float64
*************************************
NON-ENERGY CONSUMPTION-Test the original
Results of Dickey-Fuller Test:
Test Statistic                 -2.195579
p-value                         0.207854
#Lags Used                     10.000000
Number of Observations Used    35.000000
Critical Value (1%)            -3.632743
Critical Value (5%)            -2.948510
Critical Value (10%)           -2.613017
dtype: float64
NON-ENERGY CONSUMPTION-Test the 1st
Results of Dickey-Fuller Test:
Test Statistic                 -2.172822
p-value                         0.216257
#Lags Used                     10.000000
Number of Observations Used    34.000000
Critical Value (1%)            -3.639224
Critical Value (5%)            -2.951230
Critical Value (10%)           -2.614447
dtype: float64
NON-ENERGY CONSUMPTION-Test the 2nd
Results of Dickey-Fuller Test:
Test Statistic                 -2.724450
p-value                         0.069902
#Lags Used                      9.000000
Number of Observations Used    34.000000
Critical Value (1%)            -3.639224
Critical Value (5%)            -2.951230
Critical Value (10%)           -2.614447
dtype: float64
*************************************
FINAL CONSUMPTION-Test the original
Results of Dickey-Fuller Test:
Test Statistic                  3.446750
p-value                         1.000000
#Lags Used                      9.000000
Number of Observations Used    36.000000
Critical Value (1%)            -3.626652
Critical Value (5%)            -2.945951
Critical Value (10%)           -2.611671
dtype: float64
FINAL CONSUMPTION-Test the 1st
Results of Dickey-Fuller Test:
Test Statistic                 -1.457273
p-value                         0.554532
#Lags Used                      8.000000
Number of Observations Used    36.000000
Critical Value (1%)            -3.626652
Critical Value (5%)            -2.945951
Critical Value (10%)           -2.611671
dtype: float64
FINAL CONSUMPTION-Test the 2nd
Results of Dickey-Fuller Test:
Test Statistic                 -4.365760
p-value                         0.000341
#Lags Used                      6.000000
Number of Observations Used    37.000000
Critical Value (1%)            -3.620918
Critical Value (5%)            -2.943539
Critical Value (10%)           -2.610400
dtype: float64
*************************************

Step 2: Find the best (p,q) manually¶

This step is optional, as p and q can also be selected by finding the minimum values of AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) if d is already determined.

In [30]:
import statsmodels.api as sm
def order_aic_bic(df,d):
    order_aic_bic =[]

    # Loop over p values from 0-4
    for p in range(5):
        # Loop over q values from 0-4
        for q in range(5):

            try:
                # create and fit ARMA(p,q) model
                
                model = ARIMA(df, order=(p, d, q))
                results = model.fit()
                
                # Print order and results
                order_aic_bic.append((p, q, results.aic, results.bic))            
            except:
                print(p, q, None, None)

    # Make DataFrame of model order and AIC/BIC scores
    order_df = pd.DataFrame(order_aic_bic, columns=['p', 'q', 'aic','bic'])

    # lets sort them by AIC and BIC

    # Sort by AIC
    print("Models sorted by AIC ")
    #print("\n")
    print(order_df.sort_values('aic').reset_index(drop=True)[:5])

    # Sort by BIC
    print("Models sorted by BIC ")
    #print("\n")
    print(order_df.sort_values('bic').reset_index(drop=True)[:5])

Generate for all sectors¶

In [31]:
import warnings

# Set the warning filter to 'ignore' to suppress all warnings
warnings.filterwarnings("ignore")
In [32]:
for s in scaled_df.columns:
    
    # prepare data:
    # Split size of data   
    train_size=int(len(scaled_df) * 0.9) 
    df_sector= scaled_df[[s]]
   
    # Convert index to datetime format
    df_sector.index = pd.to_datetime(df_sector.index, format='%Y')
    
    # Set the frequency to 'AS' (Annual Start)
    df_sector = df_sector.asfreq('AS')
    # Train and Validate
    df_tr=df_sector.iloc[:train_size]
    df_val=df_sector.iloc[train_size:]
       
    if s=='TRANSPORT':
        print(s)
        d=2
        order_aic_bic(df_tr,d)
        print('****************')
           
        
    if s=='INDUSTRIAL':
        print(s)
        d=1
        order_aic_bic(df_tr,d)
        print('****************')
        
    if s=='RESIDENTIAL':
        print(s)
        d=2
        order_aic_bic(df_tr,d)
        print('****************')
        
    if s=='COMMERCIAL, SERVICES, PUBLIC':
        print(s)
        d=1
        order_aic_bic(df_tr,d)
        print('****************')
         
    if s=='AGRICULTURE, FISHING AND MINING':
        print(s)
        d=1
        order_aic_bic(df_tr,d)
        print('****************')
       
        
    if s=='NON-ENERGY CONSUMPTION':
        print(s)
        d=2
        order_aic_bic(df_tr,d)
        print('****************')
        
    if s=='FINAL CONSUMPTION':
        print(s)
        d=2
        order_aic_bic(df_tr,d)

      
TRANSPORT
Models sorted by AIC 
   p  q         aic         bic
0  0  1 -190.968716 -187.400337
1  1  1 -190.143405 -184.790837
2  0  2 -189.703032 -184.350463
3  4  4 -189.639963 -173.582256
4  4  3 -189.633727 -175.360209
Models sorted by BIC 
   p  q         aic         bic
0  0  1 -190.968716 -187.400337
1  1  1 -190.143405 -184.790837
2  0  2 -189.703032 -184.350463
3  1  0 -186.510659 -182.942280
4  2  0 -188.215102 -182.862533
****************
INDUSTRIAL
Models sorted by AIC 
   p  q         aic         bic
0  1  1 -171.636256 -166.216268
1  2  4 -171.597189 -158.950552
2  3  1 -171.030961 -161.997648
3  1  3 -171.026388 -161.993075
4  1  4 -170.385792 -159.545817
Models sorted by BIC 
   p  q         aic         bic
0  1  1 -171.636256 -166.216268
1  1  0 -166.985884 -163.372559
2  0  1 -166.875175 -163.261850
3  1  2 -169.889531 -162.662881
4  3  0 -169.683135 -162.456485
****************
RESIDENTIAL
Models sorted by AIC 
   p  q         aic         bic
0  0  1 -135.432786 -131.864407
1  0  2 -133.722598 -128.370029
2  1  1 -133.621598 -128.269029
3  2  1 -133.005693 -125.868935
4  0  3 -132.660096 -125.523337
Models sorted by BIC 
   p  q         aic         bic
0  0  1 -135.432786 -131.864407
1  0  2 -133.722598 -128.370029
2  1  1 -133.621598 -128.269029
3  2  1 -133.005693 -125.868935
4  0  3 -132.660096 -125.523337
****************
COMMERCIAL, SERVICES, PUBLIC
Models sorted by AIC 
   p  q         aic         bic
0  1  1 -236.786061 -231.366073
1  3  3 -235.753374 -223.106736
2  1  3 -235.011424 -225.978111
3  3  4 -234.575443 -220.122143
4  2  0 -233.167458 -227.747470
Models sorted by BIC 
   p  q         aic         bic
0  1  1 -236.786061 -231.366073
1  2  0 -233.167458 -227.747470
2  1  3 -235.011424 -225.978111
3  2  1 -232.788038 -225.561388
4  3  0 -232.150323 -224.923674
****************
AGRICULTURE, FISHING AND MINING
Models sorted by AIC 
   p  q         aic         bic
0  4  3 -128.328802 -113.875502
1  1  1 -127.907744 -122.487757
2  0  0 -127.596040 -125.789377
3  4  2 -127.204758 -114.558120
4  3  4 -126.966579 -112.513279
Models sorted by BIC 
   p  q         aic         bic
0  0  0 -127.596040 -125.789377
1  1  1 -127.907744 -122.487757
2  1  0 -125.831020 -122.217695
3  0  1 -125.819766 -122.206441
4  1  2 -126.698078 -119.471428
****************
NON-ENERGY CONSUMPTION
Models sorted by AIC 
   p  q        aic        bic
0  0  1 -72.363124 -68.794744
1  0  2 -70.421991 -65.069422
2  1  1 -70.417641 -65.065072
3  1  2 -69.226941 -62.090182
4  0  3 -68.463684 -61.326925
Models sorted by BIC 
   p  q        aic        bic
0  0  1 -72.363124 -68.794744
1  0  2 -70.421991 -65.069422
2  1  1 -70.417641 -65.065072
3  1  2 -69.226941 -62.090182
4  0  3 -68.463684 -61.326925
****************
FINAL CONSUMPTION
Models sorted by AIC 
   p  q         aic         bic
0  0  1 -201.002451 -197.434072
1  2  2 -199.419306 -190.498358
2  1  3 -199.401370 -190.480422
3  2  4 -199.158688 -186.669361
4  1  4 -199.144631 -188.439494
Models sorted by BIC 
   p  q         aic         bic
0  0  1 -201.002451 -197.434072
1  1  1 -199.041726 -193.689157
2  0  2 -199.029497 -193.676928
3  0  3 -198.721459 -191.584701
4  1  2 -198.600094 -191.463336

Step 3: Generate the model and make the prediction¶

In [33]:
# Validate model and check the error

df_C_pred=pd.DataFrame()
for s in scaled_df.columns: # Use caled data
    
    # prepare data:
    # Split size of data   
    train_size=int(len(scaled_df) * 0.9) 
    df_sector= scaled_df[[s]]
    df_sector.index = pd.to_datetime(df_sector.index, format='%Y')
    # Train and Validate
    df_tr=df_sector.iloc[:train_size]
    df_val=df_sector.iloc[train_size:]
       
    
    # Build model ARIMA with oder chose manually from previous steps
    if s=='TRANSPORT':
        model_val=ARIMA(df_tr, order=(4,2,3))
       
    if s=='INDUSTRIAL':
        model_val = ARIMA(df_tr, order=(2,1,2))
       
    if s=='RESIDENTIAL':
        model_val = ARIMA(df_tr, order=(2,2,2)) 
       
    if s=='COMMERCIAL, SERVICES, PUBLIC':
        model_val = ARIMA(df_tr, order=(4,1,4))
             
    if s=='AGRICULTURE, FISHING AND MINING':
        model_val = ARIMA(df_tr, order=(2,1,1))
       
    if s=='NON-ENERGY CONSUMPTION':
        model_val = ARIMA(df_tr, order=(4,2,3))
        
    if s=='FINAL CONSUMPTION':
        model_val = ARIMA(df_tr, order=(3,2,3))
        
    
    model_val_fit=model_val.fit()

    
    
    # Validation:
    
    forecast_val = model_val_fit.predict(start=df_val.index[0], end=df_val.index[-1])

    # Create a Pandas DataFrame with the forecasted values and set the index
    df_forecast_val = pd.DataFrame(data=forecast_val, index=df_val.index)

    mape_val=MAPE(df_val, df_forecast_val)
    # Plot

    plt.plot(df_sector, "b", label= "Actual")
    plt.plot(df_forecast_val, "orange", linestyle="--", label="Predict the validation data")
    plt.title(f'{s}-MAPE={mape_val}')
    
    # Save the figure to a file (e.g., as a PNG)
    folder_path="../Images/Validation"
    file_name = f'{s}_val.png'
    file_path = folder_path + '/' + file_name
    plt.savefig(file_path)

    plt.show()

    
    
In [34]:
df_C_pred=pd.DataFrame()
for s in scaled_df.columns: # Use caled data
    
    # prepare data:
    # Split size of data   
    train_size=int(len(scaled_df) * 0.9) 
    df_sector= scaled_df[[s]]
    df_sector.index = pd.to_datetime(df_sector.index, format='%Y')
    # Train and Validate
    df_tr=df_sector.iloc[:train_size]
    df_val=df_sector.iloc[train_size:]
       
    
    # Build model ARIMA with oder chose manually from previous steps
    if s=='TRANSPORT':       
        model = ARIMA(df_sector, order=(4,2,3))
    if s=='INDUSTRIAL':
        model = ARIMA(df_sector, order=(2,1,2))
    if s=='RESIDENTIAL':
        model = ARIMA(df_sector, order=(2,2,2)) 
    if s=='COMMERCIAL, SERVICES, PUBLIC':
        model = ARIMA(df_sector, order=(4,1,4))       
    if s=='AGRICULTURE, FISHING AND MINING':
        model = ARIMA(df_sector, order=(2,1,1))
    if s=='NON-ENERGY CONSUMPTION':
        model = ARIMA(df_sector, order=(4,2,3))
    if s=='FINAL CONSUMPTION':   
        model = ARIMA(df_sector, order=(3,2,3))
    
    
    model_fit = model.fit()
    
    
    #Predict:
    
    # Specify the number of future steps to forecast
    periods = 20  # Adjust as needed
    
    # Specify the start date
    start_date = pd.to_datetime('2022-01-01')

    # Prediction for future time points
    data_pred = model_fit.predict(start=start_date, end=start_date+ pd.DateOffset(years=periods))

    # Create a date range for the forecasted period
    forecast_index = pd.date_range(start=start_date, periods=periods+1, freq='A')

    # Create a Pandas DataFrame with the forecasted values and set the index
    df_forecast = pd.DataFrame(data=list(data_pred), index=forecast_index, columns=[s])

    # Plot
    plt.plot(df_sector, "b", label="Actual")
    plt.plot(df_forecast, "r", label="Predict 20 more years")

    #plt.title(f"{s}")
    plt.title(f'{s}')
    plt.legend()
    # Define the folder path where you want to save the image
    folder_path = "../Images/Prediction"
    file_name = f'{s}_pred.png'
    file_path = folder_path + '/' + file_name
    plt.savefig(file_path)


    # Display the figure in the Jupyter Notebook
    plt.show()
    
    #Concatenate df
    df_C_pred=pd.concat([df_C_pred,df_forecast],axis=1)
# Round the values of each column in the dataframe with 2 decimals
df_C_pred = df_C_pred.applymap(lambda x: round(x, 2) if isinstance(x, (int, float)) else x)  
df_C_pred.index=df_C_pred.index.year
In [35]:
df_C_final=pd.concat([scaled_df, df_C_pred], axis=0)
In [36]:
df_C_final.index=df_C_final.index.astype(int)

# Invert the scaled values in your DataFrame back to the original data
df_C_final = pd.DataFrame(scaler.inverse_transform(df_C_final), columns=df_C_final.columns, index=df_C_final.index)
df_C_final=df_C_final.drop("FINAL CONSUMPTION", axis=1)
In [37]:
df_tot = df_C_final.sum(axis=1)
df_C_final['TOT']=df_tot
In [38]:
df_C_final.plot(figsize=(10, 6),title= "Prediction FINAL ENERGY CONSUMPTION 20 more years", xlabel= "Year", ylabel="Values [Mtoe]")
Out[38]:
<AxesSubplot:title={'center':'Prediction FINAL ENERGY CONSUMPTION 20 more years'}, xlabel='Year', ylabel='Values [Mtoe]'>
In [39]:
df_C_final=df_C_final.applymap(lambda x: round(x, 2) if isinstance(x, (int, float)) else x) 
# save the data:
file_path='../datas/Data_forecast/predict_Final_Consumption.xlsx'
df_C_final.to_excel(file_path)
In [40]:
df_C_final_percentages = (df_C_final.drop(columns=['TOT']).div(df_C_final['TOT'], axis=0) * 100)
In [41]:
df_C_final_stackplot=df_C_final.drop(columns=['TOT'])
In [42]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(18, 6))

df_C_final_stackplot.plot.area(stacked=True, alpha=0.7, ax=ax1)
ax1.set_xlabel('Year')
ax1.set_ylabel('Energy Consumption [Mtoe]')
ax1.set_title('Energy Consumption Over the Years')
ax1.legend(loc='upper left', bbox_to_anchor=(1.0, 1.0))

df_C_final_percentages.plot.area(stacked=True, alpha=0.7, ax=ax2)

ax2.set_xlabel('Year')
ax2.set_ylabel(' Energy Consumption  (%)')
ax2.set_title('Percentage of  Energy Consumption Over the Years')
ax2.legend(loc='upper left', bbox_to_anchor=(1.0, 1.0))

plt.tight_layout()
plt.show()
In [43]:
# Filter the DataFrame to include only data for every 5 years
filtered_df = df_C_final[df_C_final.index % 5 == 0]
# Create a bar plot
ax = filtered_df['TOT'][8:].plot(x='Year', y='TOT', kind='bar', color=['b']* (len(filtered_df)-(8+4))  + ['r']*4 )

# Add value annotations to the bars
for index, value in enumerate(filtered_df['TOT'][8:]):
    ax.annotate(str(value), xy=(index, value), ha='center', va='bottom')

plt.xlabel('Year')
plt.ylabel('Total Consumption [Mtoe]')
plt.title('Sum over total sectorials')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability

plt.tight_layout()
plt.show()
In [44]:
# Save this df for further usage

file_path='../datas/Data_forecast/final_C.csv'
    
#final_df.to_excel(file_path)

df_C_final[-23:].to_csv(file_path)
In [45]:
# Check the value in a given year:
year=2028
value=df_C_final.loc[year,'TOT']
print(f'Total consumption in year  {year} is {value}')
Total consumption in year  2028 is 257.85

Copare with the result in Website: eneroutlook

https://eneroutlook.enerdata.net/brazil-energy-forecast.html

image.png

III. Scenarios:¶

In [46]:
final_consumption_data = df_concat[df_concat['SECTOR'] == 'FINAL CONSUMPTION']
final_consumption_data=final_consumption_data.drop(columns=['SECTOR'])
In [47]:
final_consumption_data.iloc[:, :10].head()
Out[47]:
OIL NATURAL GAS COAL HYDROENERGY GEOTHERMAL NUCLEAR FIREWOOD SUGARCANE AND PRODUCTS OTHER PRIMARY TOTAL PRIMARIES
YEAR
1970 NaN 0.00632 0.08224 NaN NaN NaN 28.34498 3.05959 0.14209 31.63521
1971 NaN 0.03079 0.08707 NaN NaN NaN 27.81478 3.28909 0.15295 31.37467
1972 NaN 0.06811 0.08810 NaN NaN NaN 27.51668 3.71635 0.20727 31.59650
1973 NaN 0.09330 0.07018 NaN NaN NaN 26.98772 4.02993 0.22213 31.40326
1974 NaN 0.22764 0.09930 NaN NaN NaN 26.47921 4.02503 0.25387 31.08505
In [48]:
# Clasify the energy sources:

group_CARBON=['COAL','FIREWOOD','COKE','CHARCOAL']

group_PETROLIUM= ['OIL','GASOLINE/ALCOHOL','KEROSENE/JET FUEL','DIESEL OIL','FUEL OIL']

group_NATURAL_GAS=['NATURAL GAS','LPG','GASES']

group_RENEWABLE=['SUGARCANE AND PRODUCTS']

group_ELECTRICITY= ['ELECTRICITY']

group_OTHER=['OTHER PRIMARY','OTHER SECONDARY','NON-ENERGY']
In [49]:
df_group=pd.DataFrame()

df_group['group_CARBON']=final_consumption_data[group_CARBON].sum(axis=1)
df_group['group_PETROLIUM']=final_consumption_data[group_PETROLIUM].sum(axis=1)
df_group['group_NATURAL_GAS']=final_consumption_data[group_NATURAL_GAS].sum(axis=1)
df_group['group_ELECTRICITY']=final_consumption_data[group_ELECTRICITY]
df_group['group_RENEWABLE']=final_consumption_data[group_RENEWABLE].sum(axis=1)
df_group['group_OTHER']=final_consumption_data[group_OTHER].sum(axis=1)
In [50]:
df_group.plot()
Out[50]:
<AxesSubplot:xlabel='YEAR'>
In [51]:
df_group.columns
Out[51]:
Index(['group_CARBON', 'group_PETROLIUM', 'group_NATURAL_GAS',
       'group_ELECTRICITY', 'group_RENEWABLE', 'group_OTHER'],
      dtype='object')
In [52]:
df_group_percentages = df_group.div(df_group.sum(axis=1) ,axis=0) * 100
df_group_percentages['TOTAL']=100
df_group_percentages = df_group_percentages.round(0)
In [53]:
df_group['TOTAL']=final_consumption_data[['TOTAL']]
# Round the values of each column in the dataframe with 2 decimals
df_group = df_group.applymap(lambda x: round(x, 2) if isinstance(x, (int, float)) else x)
In [54]:
# Initializing an empty DataFrame for the final combined table
final_df = pd.DataFrame()

# Iterating through each column in the original dataframe
for col in df_group.columns:
    # Constructing the paired data for 'Value' and 'Percentage'
    paired_data = pd.concat([df_group[col], df_group_percentages[col]], axis=1)
    paired_data.columns = pd.MultiIndex.from_product([[col], ['[Mtoe]', '%']])
    # Adding the paired data to the final dataframe
    final_df = pd.concat([final_df, paired_data.round(2)], axis=1)


# Save this df for further usage

file_path='../datas/Data_forecast/predict_percentages.xlsx'
    
final_df.to_excel(file_path)

# Show the final_df
final_df.tail(10)
Out[54]:
group_CARBON group_PETROLIUM group_NATURAL_GAS group_ELECTRICITY group_RENEWABLE group_OTHER TOTAL
[Mtoe] % [Mtoe] % [Mtoe] % [Mtoe] % [Mtoe] % [Mtoe] % [Mtoe] %
YEAR
2012 32.59 14.0 94.41 41.0 22.43 10.0 40.62 18.0 17.89 8.0 21.84 10.0 229.77 100
2013 31.71 14.0 98.04 42.0 22.38 10.0 41.87 18.0 17.25 7.0 22.45 10.0 233.71 100
2014 32.41 14.0 101.04 42.0 22.28 9.0 43.02 18.0 16.16 7.0 22.86 10.0 237.76 100
2015 32.23 14.0 99.11 43.0 22.09 10.0 42.24 18.0 15.53 7.0 21.34 9.0 232.54 100
2016 29.82 13.0 96.53 42.0 21.22 9.0 42.17 18.0 17.57 8.0 21.34 9.0 228.65 100
2017 32.04 14.0 98.29 42.0 21.61 9.0 42.76 18.0 17.22 7.0 21.10 9.0 233.00 100
2018 33.18 15.0 95.91 42.0 21.94 10.0 43.49 19.0 13.25 6.0 20.89 9.0 228.67 100
2019 32.31 14.0 99.31 43.0 20.62 9.0 44.00 19.0 13.19 6.0 20.87 9.0 230.30 100
2020 31.75 14.0 92.49 41.0 19.22 8.0 43.70 19.0 18.09 8.0 21.71 10.0 226.96 100
2021 33.63 14.0 99.15 42.0 21.15 9.0 45.63 19.0 15.42 7.0 22.22 9.0 237.19 100
In [55]:
df_2021=df_group_percentages.loc[2021, :]

df_2021= df_2021.drop('TOTAL') 
In [56]:
# Create a pie chart
plt.figure(figsize=(3, 3))  # Adjust the figure size as needed
plt.pie(df_2021, labels=df_2021.index, autopct='%1.1f%%', startangle=90)
plt.title('Contribution of energy sources (2021)')  # Add a title
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

# Show the pie chart
plt.show()
In [57]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(18, 6))

df_group.drop(columns=['TOTAL']).plot.area(stacked=True, alpha=0.7, ax=ax1)
ax1.set_xlabel('Year')
ax1.set_ylabel('Energy Consumption [Mtoe]')
ax1.set_title('Energy Consumption Over the Years')
ax1.legend(loc='upper left', bbox_to_anchor=(1.0, 1.0))

df_group_percentages.drop(columns=['TOTAL']).plot.area(stacked=True, alpha=0.7, ax=ax2)

ax2.set_xlabel('Year')
ax2.set_ylabel(' Energy Consumption  (%)')
ax2.set_title('Percentage of  Energy Consumption Over the Years')
ax2.legend(loc='upper left', bbox_to_anchor=(1.0, 1.0))

plt.tight_layout()
plt.show()

If we assume that the distribuition of energy are keeping as recent years, we can estimate the energy sources are going to use in the future

In [58]:
df1=pd.DataFrame()
df1=pd.concat([df1,df_group])
df1.columns

df1.tail()
Out[58]:
group_CARBON group_PETROLIUM group_NATURAL_GAS group_ELECTRICITY group_RENEWABLE group_OTHER TOTAL
YEAR
2017 32.04 98.29 21.61 42.76 17.22 21.10 233.00
2018 33.18 95.91 21.94 43.49 13.25 20.89 228.67
2019 32.31 99.31 20.62 44.00 13.19 20.87 230.30
2020 31.75 92.49 19.22 43.70 18.09 21.71 226.96
2021 33.63 99.15 21.15 45.63 15.42 22.22 237.19
In [59]:
df2=pd.DataFrame()
df2['TOTAL']=df_C_final['TOT'].loc[2022:]
df2.index.name='YEAR'
In [60]:
for col in df1.columns[:-1]:
    df2[col]=df2['TOTAL']*df_group_percentages.loc[2021, col]/100
In [61]:
df2.head()
Out[61]:
TOTAL group_CARBON group_PETROLIUM group_NATURAL_GAS group_ELECTRICITY group_RENEWABLE group_OTHER
YEAR
2022 237.73 33.2822 99.8466 21.3957 45.1687 16.6411 21.3957
2023 243.20 34.0480 102.1440 21.8880 46.2080 17.0240 21.8880
2024 244.02 34.1628 102.4884 21.9618 46.3638 17.0814 21.9618
2025 249.46 34.9244 104.7732 22.4514 47.3974 17.4622 22.4514
2026 249.85 34.9790 104.9370 22.4865 47.4715 17.4895 22.4865
In [62]:
df_scenarios= pd.concat([df1, df2], axis=0)
df_scenarios
file_path='../datas/Data_forecast/predict_scenarios.xlsx'
df_scenarios.to_excel(file_path)
In [63]:
# Create the stacked bar plot
data = df_scenarios.loc[2010:].drop(columns=['TOTAL'])  # Remove inplace=True
ax = data.plot(kind='bar', stacked=True, figsize=(10, 6))

# Customize the plot
plt.xlabel('Year')
plt.ylabel('Energy Contribution [Mtoe]')
plt.title('Energy Contribution by Source')
plt.legend(title='Source', bbox_to_anchor=(1.05, 1), loc='upper left')

# Show the plot
plt.show()

Total Supply¶

In [64]:
dt1=df_concat[df_concat.SECTOR=='TOTAL SUPPLY'][['TOTAL PRIMARIES', 'TOTAL SECUNDARIES']]
In [65]:
dt1.plot(kind='bar', stacked=True, title="Total Primaries and Secundaries Energies", ylabel= "values [Mtoe]")
Out[65]:
<AxesSubplot:title={'center':'Total Primaries and Secundaries Energies'}, xlabel='YEAR', ylabel='values [Mtoe]'>
In [66]:
dt_pri=df_concat[df_concat.SECTOR=='TOTAL SUPPLY'][df_concat.columns[1:10]]
dt_pri.NUCLEAR=np.abs(dt_pri.NUCLEAR)
dt_pri_percentages= dt_pri.div(dt_pri.sum(axis=1) ,axis=0) * 100
dt_pri_percentages.plot.area(stacked=True, alpha=0.9, title= "Percentage of Primaries Energies", ylabel= "%")
Out[66]:
<AxesSubplot:title={'center':'Percentage of Primaries Energies'}, xlabel='YEAR', ylabel='%'>
In [67]:
dt_sec=df_concat[df_concat.SECTOR=='TOTAL SUPPLY'][df_concat.columns[11:-3]]

dt_sec_percentages= dt_sec.div(dt_sec.sum(axis=1) ,axis=0) * 100

dt_sec_percentages.plot.area(stacked=True, alpha=0.9,title= "Percentage of Secundaries Energies", ylabel= "%")
Out[67]:
<AxesSubplot:title={'center':'Percentage of Secundaries Energies'}, xlabel='YEAR', ylabel='%'>

Conclusion:¶

  • This project aimed to predict energy consumption in Brazil using an ARIMA model, which was trained on a dataset containing a 52-year record of the energy balance matrix published on the Olade website. The dataset consists of time series data collected annually. The ARIMA model was trained with 90% of the data for training and 10% for validation, and it predicts observations for an additional 20 years, extending up to the year 2042.

  • The model was successfully built by selecting the parameters (p, d, q) and using the mean absolute percentage error (MAPE) optimization algorithm to assess the model's performance. The results have shown good performance when applied and can be compared with the results published on the official website. It's worth noting that this evaluation depends heavily on the choice of parameters (p, d, q). Different parameter selections could lead to different results.